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Abstract 


We present AutoExtend, a system to learn 
embeddings for synsets and lexemes. It is 
flexible in that it can take any word embed¬ 
dings as input and does not need an addi¬ 
tional training corpus. The synset/lexeme 
embeddings obtained live in the same vec¬ 
tor space as the word embeddings. A 
sparse tensor formalization guarantees ef¬ 
ficiency and parallelizability. We use 
WordNet as a lexical resource, but Auto¬ 
Extend can be easily applied to other 
resources like Freebase. AutoExtend 
achieves state-of-the-art performance on 
word similarity and word sense disam¬ 
biguation tasks. 


1 Introduction 


Unsupervised methods for word embeddings (also 
called “distributed word representations”) have 
become popular in natural language processing 
(NEP). These methods only need very large cor¬ 
pora as input to create sparse representations (e.g., 
based on local collocations) and project them into 
a lower dimensional dense vector space. Examples 


for word embeddings are SENNA (Collobert and 


Weston, 20081, the hierarchical log-bilinear model 


(Mnih and Hinton, 20091, word2vec (Mikolov et 


ah, 2013c| l and GloVe ( [Pennington et ah, 2014| l. 
However, there are many other resources that are 
undoubtedly useful in NEP, including lexical re¬ 
sources like WordNet and Wiktionary and knowl¬ 
edge bases like Wikipedia and Freebase. We will 
simply call these resources in the rest of the pa¬ 
per. Our goal is to enrich these valuable resources 
with embeddings for those data types that are not 
words; e.g., we want to enrich WordNet with em¬ 
beddings for synsets and lexemes. A synset is a set 
of synonyms that are interchangeable in some con¬ 
text. A lexeme pairs a particular spelling or pro¬ 


nunciation with a particular meaning, i.e., a lex¬ 
eme is a conjunction of a word and a synset. Our 
premise is that many NEP applications will bene¬ 
fit if the non-word data types of resources - e.g., 
synsets in WordNet - are also available as embed¬ 
dings. For example, in machine translation, en¬ 
riching and improving translation dictionaries (cf. 
[Mikolov et al. (201 3b[ )) would benefit from these 
embeddings because they would enable us to cre¬ 
ate an enriched dictionary for word senses. Gen¬ 
erally, our premise is that the arguments for the 
utility of embeddings for word forms should carry 
over to the utility of embeddings for other data 
types like synsets in WordNet. 

The insight underlying the method we propose 
is that the constraints of a resource can be formal¬ 
ized as constraints on embeddings and then allow 
us to extend word embeddings to embeddings of 
other data types like synsets. For example, the hy- 
ponymy relation in WordNet can be formalized as 
such a constraint. 

The advantage of our approach is that it de¬ 
couples embedding learning from the extension of 
embeddings to non-word data types in a resource. 
If somebody comes up with a better way of learn¬ 
ing embeddings, these embeddings become imme¬ 
diately usable for resources. And we do not rely on 
any specific properties of embeddings that make 
them usable in some resources, but not in others. 

An alternative to our approach is to train embed¬ 
dings on annotated text, e.g., to train synset em¬ 
beddings on corpora annotated with synsets. How¬ 
ever, successful embedding learning generally re¬ 
quires very large corpora and sense labeling is too 
expensive to produce corpora of such a size. 

Another alternative to our approach is to add up 
all word embedding vectors related to a particular 
node in a resource; e.g., to create the synset vector 
of lawsuit in WordNet, we can add the word vec¬ 
tors of the three words that are part of the synset 
(lawsuit, suit, case). We will call this approach 










naive and use it as a baseline (Snaive in Table [^. 

We will focus on WordNet dFellbaum, 1998| l in 
this paper, but our method - based on a formaliza¬ 
tion that exploits the constraints of a resource for 
extending embeddings from words to other data 
types - is broadly applicable to other resources in¬ 
cluding Wikipedia and Freebase. 

A word in WordNet can be viewed as a compo¬ 
sition of several lexemes. Lexemes from different 
words together can form a synset. When a synset 
is given, it can be decomposed into its lexemes. 
And these lexemes then join to form words. These 
observations are the basis for the formalization of 
the constraints encoded in WordNet that will be 
presented in the next section: we view words as 
the sum of their lexemes and, analogously, synsets 
as the sum of their lexemes. 


Another motivation for our formalization stems 


from the analogy calculus developed by Mikolov 


et al. (2013a|l, which can be viewed as a group 


theory formalization of word relations: we have 
a set of elements (our vectors) and an operation 
(addition) satisfying the properties of a mathemat¬ 
ical group, in particular, associativity and invert- 
ibility. For example, you can take the vector of 
king, subtract the vector of man and add the vec¬ 
tor of woman to get a vector near queen. In other 
words, you remove the properties of man and add 
the properties of woman. We can also see the vec¬ 
tor of king as the sum of the vector of man and the 
vector of a gender-neutral ruler. The next thing 
to notice is that this does not only work for words 
that combine several properties, but also for words 
that combine several senses. The vector of suit can 
be seen as the sum of a vector representing law¬ 
suit and a vector representing business suit. Auto- 
Extend is designed to take word vectors as input 
and unravel the word vectors to the vectors of their 
lexemes. The lexeme vectors will then give us the 
synset vectors. 


The main contributions of this paper are: (i) 
We present AutoExtend, a flexible method that ex¬ 
tends word embeddings to embeddings of synsets 
and lexemes. AutoExtend is completely general in 
that it can be used for any set of embeddings and 
for any resource that imposes constraints of a cer¬ 
tain type on the relationship between words and 
other data types, (ii) We show that AutoExtend 
achieves state-of-the-art word similarity and word 
sense disambiguation (WSD) performance, (iii) 
We publish the AutoExtend code for extending 


word embeddings to other data types, the lexeme 
and synset embeddings and the software to repli¬ 
cate our WSD evaluation. 

This paper is structured as follows. Sectionj^in- 
troduces the model, first as a general tensor formu¬ 
lation then as a matrix formulation making addi¬ 
tional assumptions. In Section|^ we describe data, 
experiments and evaluation. We analyze Auto¬ 
Extend in Section]^ and give a short summary on 
how to extend our method to other resources in 
Section [5] Section [^discusses related work. 

2 Model 

We are looking for a model that extends standard 
embeddings for words to embeddings for the other 
two data types in WordNet: synsets and lexemes. 
We want all three data types - words, lexemes, 
synsets - to live in the same embedding space. 

The basic premise of our model is: (i) words are 
sums of their lexemes and (ii) synsets are sums of 
their lexemes. We refer to these two premises as 
synset constraints. For example, the embedding 
of the word bloom is a sum of the embeddings of 
its two lexemes bloom(organ) and bloom(period)', 
and the embedding of the synset flower-bloom- 
blossoni(organ) is a sum of the embeddings of 
its three lexemes flower(organ), bloom(organ) and 
blossom( organ). 

The synset constraints can be argued to be the 
simplest possible relationship between the three 
WordNet data types. They can also be motivated 
by the way many embeddings are learned from 
corpora - for example, the counts in vector space 
models are additive, supporting the view of words 
as the sum of their senses. The same assumption 
is frequently made; for example, it underlies the 
group theory formalization of analogy discussed 
in Section [T] 

We denote word vectors as G M”, synset 
vectors as G M"', and lexeme vectors as G 
M”. is that lexeme of word that is a mem¬ 
ber of synset We set lexeme vectors that 
do not exist to zero. For example, the non-existing 
lexeme flower(truck) is set to zero. We can then 
formalize our premise that the two constraints (i) 
and (ii) hold as follows: 


i 

( 1 ) 

j 

( 2 ) 
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These two equations are underspecified. We there¬ 
fore introduce the matrix 

liid) = ( 3 ) 


We make the assumption that the dimensions in 
Eq. I are independent of each other, i.e., 
is a diagonal matrix. Our motivation for this as¬ 
sumption is: (i) This makes the computation tech¬ 
nically feasible by significantly reducing the num¬ 
ber of parameters and by supporting parallelism, 
(ii) Treating word embeddings on a per-dimension 


basis is a freque nt design choice (e.g., Kalchbren- 
ner et al. (2014 )). Note that we allow < 0 

and in general the distribution weights for each di¬ 
mension (diagonal entries of £'(*’■?)) will be differ¬ 
ent. Our assumption can be interpreted as word 
distributing its embedding activations to its 


w 


(i) 


lexemes on each dimension separately. Therefore, 
Eqs.[T]|^can be written as follows: 


'^Edd)wid) 

7 

(4) 

J 

(5) 


Note that from Eq.j^it directly follows that: 

Vf (6) 

j 


with In being the identity matrix. 

Eet VE be a |kE| x n matrix where n is the di¬ 
mensionality of the embedding space, |kE| is the 
number of words and each row is a word em¬ 
bedding; and let 5 be a IS| x n matrix where IS"! is 
the number of synsets and each row is a synset 
embedding. W and S can be interpreted as linear 
maps and a mapping between them is given by the 
rank 4 tensor E G ]^|S|xnx|iy|xn 
write Eq.j^as a tensor product: 

5 = E ® fE (7) 

while Eq. Instates, that 


not exist: E^^ = 0 if = 0; i.e., ^ 0 

only if word i has a lexeme that is a member of 
synset j. To summarize the sparsity: 

= 0 ^ di / d2 V = 0 (9) 

2.1 Learning 

We adopt an autoencoding framework to learn em¬ 
beddings for lexemes and synsets. To this end, we 
view the tensor equation 5 = E (g) kE as the en¬ 
coding part of the autoencoder: the synsets are the 
encoding of the words. We define a corresponding 
decoding part that decodes the synsets into words 
as follows: 

g(i) = (10) 

* 3 

In analogy to we introduce the diagonal ma¬ 

trix D 04): 

= jj{3l) g{3) (11) 

In this case, it is the synset that distributes itself to 
its lexemes. We can then rewrite Eq.[T0|to: 

sO) = = ^£)04)s(i) (12) 

* 3 

and we also get the equivalent of Eq. [^for 7904); 

J] 7904) =4 vj (13) 

i 

and in tensor notation: 

W = D ® 5 (14) 

Normalization and sparseness properties for the 
decoding part are analogous to the encoding part: 

= l Vj,di,d2 (15) 

i 

Dl’J = 0 ^ di / d2 V f0.i) = 0 (16) 

We can state the learning objective of the autoen¬ 
coder as follows: 


= i yi,di,d2 ( 8 ) 

3 


argmin||D (g) E (g) IE — 1E|| (17) 

E,D 


Additionally, there is no interaction between dif¬ 
ferent dimensions, so E*’^^ = 0 if di ^ 2 - In 
other words, we are creating the tensor by stacking 
the diagonal matrices over i and j. Another 
sparsity arises from the fact that many lexemes do 


under the conditions Eq.[^[9l[T5]and[T^ 

Now we have an autoencoder where input and 
output layers are the word embeddings and the 
hidden layer represents the synset vectors. A sim¬ 
plified version is shown in Eigure[T] The tensors E 




and D have to be learned. They are rank 4 tensors 
of size However, we already discussed that 

they are very sparse, for two reasons: (i) We make 
the assumption that there is no interaction between 
dimensions, (ii) There are only few interactions 
between words and synsets (only when a lexeme 
exists). In practice, there are only elements 
to learn, which is technically feasible. 

2.2 Matrix formalization 

Based on the assumption that each dimension is 
fully independent from other dimensions, a sepa¬ 
rate autoencoder for each dimension can be cre¬ 
ated and trained in parallel. Let W € be 

a matrix where each row is a word embedding and 
= W.^d the d-th column of W, i.e., a vector 
that holds the d-th dimension of each word vector. 
In the same way, = S'. ^ holds the d-th di¬ 
mension of each synset vector and = E ■4 e 
]^|5|x|VK|_ yvj-jte 5 = E (g) kL as: 

gid) ^ Vd (18) 



noun 

verb 

adj 

adv 

hypernymy 

84,505 

13,256 

0 

0 

antonymy 

2,154 

1,093 

4,024 

712 

similarity 

0 

0 

21,434 

0 

verb group 

0 

1,744 

0 

0 


Table 1: # of WN relations by part-of-speech 


This can also be expressed dimension-wise. The 
matrix formulation is given by: 


argmin 

eW,d4) 


diag(tn(''^)) — diag(s(''^)))^ Vd 

(24) 

with diag(x) being a square matrix having x 
on the main diagonal and vector defined by 
Eq. 18 While we try to align the embeddings, 
there are still two different lexeme embeddings. In 
all experiments reported in Section we will use 
the average of both embeddings and in Section 
we will analyze the weighting in more detail. 


2.4 WN relations 


with = 0 if (hd) = 0. The decoding equation 
VE = D (g) S' takes this form: 

^(rf) ^ Vd (19) 

where = D;;j e rI'^IxI'S'I and ofj = 0 if 

=0. So E and D are symmetric in terms 
of non-zero elements. The learning objective be¬ 
comes: 

argmin II Vd (20) 

e4),d4) 

2.3 Lexeme embeddings 

The hidden layer S of the autoencoder gives us 
synset embeddings. The lexeme embeddings are 
defined when fransifioning from W fo S, or more 
explicifly by: 


Some WordNef synsefs confain only a single word 
(lexeme). The autoencoder learns based on the 
synset constraints, i.e., lexemes being shared by 
different synsets (and also words); thus, it is dif¬ 
ficult to learn good embeddings for single-lexeme 
synsets. To remedy this problem, we impose the 
constraint that synsets related by WordNet (WN) 
relations should have similar embeddings. Table[T] 
shows relations we used. WN relations are entered 
in a new matrix R G , where r is the number 

of WN relation tuples. For each relation tuple, i.e., 
row in R, we set the columns corresponding to the 
first and second synset to 1 and —1, respectively. 
The values of R are not updated during training. 
We use a squared error function and 0 as target 
value. This forces the system to find similar val¬ 
ues for related synsefs. Formally, fhe WN relation 
constraints are: 


^ ( 21 ) 

However, fhere is also a second lexeme embedding 
in AufoExlend when transifioning form S' fo VE: 

^ J4ij,i)gij) (22) 


Aligning fhese fwo represenfafions seems nafural, 
so we impose fhe following lexeme constraints: 


argmin 


E(bj)wii) - !)(■?■’*) 


Vz,j (23) 


argmin||iiE*''^^t(;^'^^ II Vd (25) 


2.5 Implementation 


Our training objective is minimization of the sum 
of synset constraints (Eq. [20]), weighted by a, the 


lexeme constraints (Eq. 24 1 , weighted by (5, and 
the WN relation constraints (Eq.[25|), weighted by 


1 — a — /3. 

The training objective cannot be solved analyt¬ 
ically because it is subject to constraints Eq. 

















L/suit (textil) S/suit-of-clothes L/suit (textil) 



Figure 1: A small subgraph of WordNet. The circles are intended to show four different embedding dimensions. These 
dimensions are treated as independent. The synset constraints align the input and the output layer. The lexeme constraints align 
the second and fourth layers. 


Eq. Eq. 15 and Eq. 16 We therefore use back- 
propagation. We do not use regularization since 
we found that all learned weights are in [—2, 2]. 

AutoExtend is implemented in MATEAB. We 
run 1000 iterations of gradient descent. On an In¬ 
tel Xeon CPU E7-8857 v2 3.00GHz, one iteration 
on one dimension takes less than a minute because 
the gradient computation ignores zero entries in 
the matrix. 


2.6 Column normalization 

Our model is based on the premise that a word is 
the sum of its lexemes (Eq. [T]l. Erom the defini¬ 
tion of we derived that E G ]^l<S’|xnx|tT|xn 

is normalized over the first dimension (Eq. |^. So 
E(d) G is also normalized over the first 

dimension. In other words, is a column nor¬ 
malized matrix. Another premise of the model is 
that a synset is the sum of its lexemes. Therefore, 
is also column normalized. A simple way 
to implement this is to start the computation with 
column normalized matrices and normalize them 
again after each iteration as long as the error func¬ 
tion still decreases. When the error function starts 
increasing, we stop normalizing the matrices and 
continue with a normal gradient descent. This re¬ 
spects that while E^^"^ and 77should be column 
normalized in theory, there are a lot of practical 
issues that prevent this, e.g., OOV words. 



WordNet 

n word2vec 

words 

147,478 

54,570 

synsets 

117,791 

73,844 

lexemes 

207,272 

106,167 


Table 2: # of items in WordNet and after intersection with 
word2vec vectors 


3 Data, experiments and evaluation 


We downloaded 300-dimensional embeddings for 
3,000,000 words and phrases trained on Google 
News, a corpus of sslO^^ tokens, using word2vec 


CBOW ( [Mikolov et ah, 201 3c| . Many words 
in the word2vec vocabulary are not in WordNet, 
e.g., inflected forms {cars) and proper nouns {Tony 
Blair). Conversely, many WordNet lemmas are 
not in the word2vec vocabulary, e.g., 42 (digits 
were converted to 0). This results in a number of 
empty synsets (see Table |^. Note however that 
AutoExtend can produce embeddings for empty 
synsets because we use WN relation constraints in 
addition to synset and lexeme constraints. 

We run AutoExtend on the word2vec vectors. 
As we do not know anything about a suitable 
weighting for the three different constraints, we 
set a = fi = 0.33. Our main goal is to produce 
compatible embeddings for lexemes and synsets. 
Thus, we can compute nearest neighbors across all 
three data types as shown in Eigurej^ 

We evaluate the embeddings on WSD and on 
similarity performance. Our results depend di¬ 
rectly on the quality of the underlying word em¬ 
beddings, in our case word2vec embeddings. We 
would expect even better evaluation results as 
word representation learning methods improve. 
Using a new and improved set of underlying em¬ 
beddings is simple: it is a simple switch of the 
input file fhat contains the word embeddings. 


3.1 Word Sense Disambiguation 

Eor WSD we use the shared tasks of Senseval- 


2 (Kilgarriff, 20011 and Senseval-3 (Mihalcea et 


ah, 20041 and a system named IMS (Zhong and 


Ng, 20101. Senseval-2 contains 139, Senseval-3 




















































nearest neighbors of W/suit 
S/suit (businessman), L/suit (businessman), 
L/accomodate, S/suit (be acceptable), L/suit (be accept¬ 
able), L/lawsuit, W/lawsuit, S/suit (playing card), L/suit 
(playing card). S/suit (petition). S/lawsuit, W/countersuit, 
W/complaint, W/counterclaim 

nearest neighbors of W/lawsuit 
L/lawsuit, S/lawsuit, S/countersuit, L/countersuit, 
W/countersuit, W/suit, W/counterclaim, S/counterclaim 
(n), L/counterclaim (n), S/counterclaim (v), 
L/counterclaim (v), W/sue, S/sue (n), L/sue (n) 

nearest neighbors of S/suit-of-clothes 
L/suit-of-clothes, S/zoot-suit, L/zoot-suit, W/zoot-suit, 
S/garment, L/garment, S/dress, S/trousers, L/pinstripe, 
L/shirt, W/tuxedo, W/gabardine, W/tux, W/pinstripe 

Figure 2; Five nearest word (W/), lexeme (L/) and synset (S/) 
neighbors for three items, ordered by cosine 


57 different words. They provide 8,611, respec¬ 
tively 8,022 training instances and 4,328, respec¬ 
tively 3,944 test instances. For the system, we 
use the same setting as in the original paper. Pre¬ 
processing consists of sentence splitting, tokeniza- 
tion, POS tagging and lemmatization; the classi¬ 
fier is a linear SVM. In our experiments (Table [^, 
we run IMS with each feature set by itself to as¬ 
sess the relative strengths of feature sets (lines 1- 
7) and on feature set combinations to determine 
which combination is best for WSD (lines 8, 12- 
15). 

IMS implements three standard WSD feature 
sets: part of speech (POS), surrounding word and 
local collocation (lines 1-3). 

Let w be an ambiguous word with k senses. The 
three feature sets on lines 5-7 are based on the 
AutoExtend embeddings 1 < J < A:, of the 
synsets of w and the centroid c of the sentence in 
which w occurs. The centroid is simply the sum of 
all word2vec vectors of the words in the sentence, 
excluding stop words. 

The S-cosine feature set consists of the k 
cosines of centroid and synset vectors: 

< cos(c, cos(c, ..., cos(c, > 

The S-product feature set consists of the nk 
element-wise products of centroid and synset vec¬ 
tors: 

< > 

where q (resp. is element i of c (resp. 

The idea is that we let the SVM estimate how im¬ 
portant each dimension is for WSD instead of giv¬ 
ing all equal weight as in S-cosine. 


The S-raw feature set simply consists of the 
n{k + 1) elements of centroid and synset vectors: 


< C1 C 


(k) 

,s\s. 


(k) 


> 


Our main goal is to determine if AutoExtend 
features improve WSD performance when added 
to standard WSD features. To make sure that 
improvements we get are not solely due to the 
power of word2vec, we also investigate a sim¬ 
ple word2vec baseline. Eor S-product, the Auto¬ 
Extend feature set that performs best in the exper¬ 
iment (cf. lines 6 and 14), we test the alternative 
word2vec-based S„aive-product feature set. It has 
the same definition as S-product except that we 
replace the synset vectors with naive synset 
vectors defined as the sum of the word2vec 
vectors of the words that are members of synset j. 

Eines 1-7 in Table show the performance of 
each feature set by itself. We see that the synset 
feature sets (lines 5-7) have a comparable perfor¬ 
mance to standard feature sets. S-product is the 
strongest of them. 

Eines 8-16 show the performance of different 
feature set combinations. MES (line 8) is the most 
frequent sense baseline. Eines 9&10 are the win¬ 
ners of Senseval. The standard configuration of 
IMS (line 11) uses the three feature sets on lines 
1-3 (POS, surrounding word, local collocation) 
and achieves an accuracy of 65.2% on the English 
lexical sample task of Senseval-2 and 72.3% on 
Senseval- 30 Eines 12-16 add one additional fea¬ 
ture set to the IMS system on line 11; e.g., the sys¬ 
tem on line 14 uses POS, surrounding word, local 
collocation and S-product feature sets. The system 
on line 14 outperforms all previous systems, most 
of them significantly. While S-raw performs quite 
reasonably as a feature set alone, it hurts the per¬ 
formance when used as an additional feature set. 
As this is the feature set that contains the largest 
number of features (n{k + 1)), overfitting is the 
likely reason. Conversely, S-cosine only adds k 
features and therefore may suffer from underfit- 

tingQ 

We do a grid search (step size .1) for optimal 
values of a and /3, optimizing the average score of 
Senseval-2 and Senseval-3. The best performing 
feature set combination is Soptimized -product with 

^Zhong and Ng (2010| report accuracies of 65.3% / 
72.6% for this configuration. 

%n Table and Table 0 results significantly worse than 
the best (bold) result in each column are marked f for q = 
.05 and f for a = .10 (one-tailed Z-test). 








Senseval-2 Senseval-3 


IMS feature sets 

1 

2 

3 

4 

5 

6 

7 

POS 

surrounding word 
local collocation 

Snaive -product 

S-cosine 

S-product 

S-raw 

53.6 

57.6 

58.7 

56.5 

55.5 

58.3 

56.8 

58.0 

65.3 

64.7 

62.2 

60.5 

64.3 

63.1 


8 

MFS 

47.6"^ 

55)? 

C 

O 

9 

Rank 1 system 

64.2"^ 

72.9 

•c 

10 

Rank 2 system 

63.81^ 

72.6 

'§ 

11 

IMS 

65.2* 

72.3* 

o 

o 

12 

IMS + Snaive-prod. 

62.6* 

69.4* 

6 

o 

13 

IMS + S-cosine 

65.1* 

72.4* 

IZl 

14 

IMS + S-product 

66.5 

73.6 


15 

IMS + S-raw 

62.1* 

66.8* 


16 

IMS + Soptimized-prod. 

66.6 

73.6 


Table 3: WSD accuracy for different feature sets and systems. 
Best result (excluding line 16) in each column in bold. 


a = 0.2 and (3 = 0.5, with only a small improve¬ 
ment (line 16). 

The main result of this experiment is that we 
achieve an improvement of more than 1% in WSD 
performance when using AutoExtend. 


3.2 Synset and lexeme similarity 

We use sews ( Huang et ah, 2012] ) for the similar¬ 
ity evaluation. SCWS provides not only isolated 
words and corresponding similarity scores, but 
also a context for each word. SCWS is based on 
WordNet, but the information as to which synset a 
word in context came from is not available. How¬ 
ever, the dataset is the closest we could find for 
sense similarity. Synset and lexeme embeddings 
are obtained by running AutoExtend. Based on 
the results of the WSD task, we set a = 0.2 and 
/3 = 0.5. Eexeme embeddings are the natural 
choice for this task as human subjects are provided 
with two words and a context for each and then 
have to assign a similarity score. But for complete¬ 
ness, we also run experiments for synsets. 

Eor each word, we compute a context vector 
c by adding all word vectors of the context, ex¬ 


cluding the test word itself. Eollowing Reisinger 


and Mooney (2010]), we compute the lexeme (resp. 


synset) vector I either as the simple average of 
the lexeme (resp. synset) vectors (method 
AvgSim, no dependence on c in this case) or 
as the average of the lexeme (resp. synset) vec¬ 
tors weighted by cosine similarity to c (method 
AvgSimC). 

Table m shows that AutoExtend lexeme embed¬ 
dings (line 7) perform better than previous work. 



AvgSim AvgSimC 

1 Huang et al. (2012 1 

62.8* 65.7* 

65.4* 

67.2 69.3 

66.2* 68.9 

2 Tian et al. (2014 1 

3 Neelakantan et al. (2014 1 

4 Chen et al. (2014) 

5 words (word2vec) 

6 synsets 

7 lexemes 

66.6* 66)? 

62.6* 63.7* 

68.9 69.8 


Table 4: Spearman correlation (p x 100) on SCWS. Best re¬ 
sult per column in bold. 


including (Huang et ah, 201^ and (Tian et ah. 


2014). Eexeme embeddings perform better than 


synset embeddings (lines 7 vs. 6), presumably be¬ 
cause using a representation that is specific fo fhe 
acfual word being judged is more precise fhan us¬ 
ing a represenfafion fhaf also includes synonyms. 

A simple baseline is fo use fhe underlying 
word2vec embeddings direcfly (line 5). In fhis 
case, fhere is only one embedding, so fhere is no 
difference befween AvgSim and AvgSimC. If is in¬ 
teresting fhaf even if we do nol lake fhe conlexl 
info accounl (melhod AvgSim) fhe lexeme embed¬ 
dings oufperform fhe original word embeddings. 
As AvgSim simply adds up all lexemes of a word, 
fhis is equivalenf fo fhe consfrainl we proposed in 
fhe beginning of fhe paper (Eq.[T]). Thus, replacing 
a word’s embedding by fhe sum of fhe embeddings 
of ifs senses could generally improve fhe qualify of 
embeddings (cf. Huang el al. (2012) for a similar 
poinl). We will leave a deeper evaluafion of fhis 
topic for fulure work. 


4 Analysis 

We firsl look al fhe impacl of fhe parameters a, 13 
(Section |2.5| ) fhaf conlrol fhe weighling of synsel 
consfrainls vs lexeme consfrainfs vs WN relation 
consfrainls. We invesfigafe fhe impacl for Ihree 
differenl lasks. WSD-alone: accuracy of IMS 
(average of Senseval-2 and Senseval-3) if only S- 
producl is used as a fealure sel (line 6 in Table [^. 
WSD-additional: accuracy of IMS (average of 
Senseval-2 and Senseval-3) if S-producl is used 
logelher wilh fhe fealure sels POS, surrounding 
word and local collocalion (line 14 in Table [^. 
SCWS: Spearman correlation on SCWS (line 7 in 
Table 1^. 

Eor WSD-alone (Eigurej^ center), fhe besl per¬ 
forming weighlings (red) all have high weighls 
for WN relafions and are Iherefore al fhe lop of 
Iriangle. Thus, WN relafions are very imporfanf 
for WSD-alone and adding more weighl to fhe 































synset and lexeme constraints does not help. How¬ 
ever, all three constraints are important in WSD- 
additional: the red area is in the middle (corre¬ 
sponding to nonzero weights for all three con¬ 
straints) in the left panel of Figure Apparently, 
strongly weighted lexeme and synset constraints 
enable learning of representations that in their in¬ 
teraction with standard WSD feature sets like lo¬ 
cal collocation increase WSD performance. For 
sews (right panel), we should not put too much 
weight on WN relations as they artificially bring 
related, but not similar lexemes together. So the 
maximum for this task is located in the lower part 
of the triangle. 

The main result of this analysis is that Auto- 
Extend never achieves its maximum performance 
when using only one set of constraints. All three 
constraints are important - synset, lexeme and WN 
relation constraints - with different weights for 
different applications. 

We also analyzed the impact of the four differ¬ 
ent WN relations (see Table[^ on performance. In 
Tablej^and Table|^ all four WN relations are used 
together. We found that any combination of three 
relation types performs worse than using all four 
together. A comparison of different relations must 
be done carefully as they differ in the POS they 
affect and in quantity (see Table [^. In general, re¬ 
lation types with more relations outperformed re¬ 
lation types with fewer relations. 

Finally, the relative weighting of and 
when computing lexeme embeddings is also a pa¬ 
rameter that can be tuned. We use simple aver¬ 
aging {9 = 0.5) for all experiments reported in 
this paper. We found only small changes in per¬ 
formance for 0.2 < 9 < 0.8. 


5 Resources other than WordNet 


AutoExtend is broadly applicable to lexical and 
knowledge resources that have certain properties. 
While we only run experiments with WordNet in 
this paper, we will briefly address other resources. 


Eor Freebase (Bollacker et ah, 20081, we could re 


place the synsets with Ereebase entities. Each en¬ 
tity has several aliases, e.g. Barack Obama, Presi¬ 
dent Obama, Obama. The role of words in Word- 
Net would correspond to these aliases in Ereebase. 
This will give us the synset constraint, as well as 
the lexeme constraint of the system. Relations are 
given by Ereebase types; e.g., we can add a con¬ 
straint that entity embeddings of the type ’’Presi¬ 


dent of the US” should be similar. 

To explorer multilingual word embeddings we 
require the word embeddings of different lan¬ 
guages to live in the same vector space, which 
can easily be achieved by training a transforma¬ 
tion matrix L between two languages using known 
translations ( [Mikolov et ah, 2013bl l. Eet X be a 
matrix where each row is a word embedding in 
language 1 and Y a matrix where each row is a 
word embedding in language 2. Eor each row the 
words of X and U are a translation of each other. 
We then want to minimize the following objective: 


argmin||LX — Y\\ (26) 

L 

We can use a gradient descent to solve this but a 
matrix inversion will run faster. The matrix L is 
given by: 


L = {X^ * X)-^{X^ *Y) 


(27) 


The matrix L can be used to transform unknown 
embeddings into the new vector space, which en¬ 
ables us to use a multilingual WordNet like Ba- 
belNet ( [Navigli and Ponzetto, 2010 ) to compute 
synset embeddings. We can add cross-linguistic 
relationships to our model, e.g., aligning German 
and English synset embeddings of the same con¬ 
cept. 

6 Related Work 


Rumelhart et al. (19881 introduced distributed 


word representations, usually called word embed¬ 
dings today. There has been a resurgence of 


work on them recently (e.g., Bengio et al. (20031 
Mnih and Hinton (2007| l, |Collobert et al. (201 1| ), 
Mikolov et al. (201 3a| ), Pennington et al. (2014)). 
These models produce only a single embedding 
for each word. All of them can be used as input 
for AutoExtend. 

There are several approaches to finding embed¬ 
dings for senses, variously called meaning, sense 


and multiple word embeddings. Schiitze (1998) 


created sense representations by clustering context 
representations derived from co-occurrence. The 
representation of a sense is simply the centroid of 
its cluster. Huang et al. (2012] ) improved this by 
learning single-prototype embeddings before per¬ 


forming word sense discrimination on them. Bor- 
des et al. (2011| ) created similarity measures for 
relations in WordNet and Ereebase to learn en¬ 
tity embeddings. An energy based model was 































WSD-additional WSD-alone SCWS 



Figure 3: Performance of different weightings of the three constraints (WN relationsdop, lexemesdeft, synsets:right) on the 
three tasks WSD-additional, WSD-alone and SCWS. “x” indicates the maximum; “o” indicates a local minimum. 


proposed by [Bordes et al. (20 T2 1 to create dis¬ 


ambiguated meaning embeddings and Neelakan- 


|tan et al. (2014| ) and |Tian et al. (2014| ) extended 
the Skip-gram model (Mikolov et al., 2013aI to 
learn multiple word embeddings. While these em¬ 
beddings can correspond to different word senses, 
there is no clear mapping between them and a lexi¬ 
cal resource like WordNet. |Chen et al. (2014 ) also 
modified word2vec to learn sense embeddings, 
each corresponding to a WordNet synset. They 
use glosses to initialize sense embedding, which 
in turn can be used for WSD. The sense disam¬ 
biguated data can again be used to improve sense 
embeddings. 

This prior work needs a training step to learn 
embeddings. In contrast, we can “AutoExtend” 
any set of given word embeddings - without 
(re)training them. 

There is only little work on taking existing 
word embeddings and producing embeddings in 
the same space. Labutov and Lipson (2013| l tuned 
existing word embeddings in supervised training, 
not to create new embeddings for senses or enti¬ 
ties, but to get better predictive performance on a 
task while not changing the space of embeddings. 

Lexical resources have also been used to im¬ 
prove word embeddings. In the Relation Con¬ 
strained Model, I Yu and Dredze (2014[ ) use 
word2vec to learn embeddings that are optimized 
to predict a related word in the resource, with good 
evaluation results. Bian et al. (2014| | used not 
only semantic, but also morphological and syn¬ 
tactic knowledge to compute more effective word 
embeddings. 

Another interesting approach to create sense 
specific word embeddings uses bilingual resources 
( Guo ef al., 2014| ). The downside of fhis approach 
is fhaf parallel dafa is needed. 


We used fhe SCWS dafasef for fhe word similar- 
ify fask, as if provides a confexf. Ofher frequenlly 
used dafasefs are WordSim-353 (IFinkelsfein el al.,| 


20011 or MEN (Bruni el al., 20141. 


And while we use cosine lo compule similar- 
ily belween synsels, Ihere are also a lol of simi- 
larily measures lhal only rely on a given resource, 
moslly WordNel. These measures are often func¬ 
tions fhaf depend on fhe provided information like 
gloss or fhe fopology like shorlesl-pafh. Examples 
include ( Wu and Palmer, 1994 1 and ( [Leacock and| 
Chodorow, 19981; Blanchard ef al. (2005| ) give a 


good overview. 


7 Conclusion 

We presenfed AufoExfend, a flexible melhod lo 
learn synsel and lexeme embeddings from word 
embeddings. If is completely general and can be 
used for any ofher sef of embeddings and for any 
ofher resource fhaf imposes conslrainls of a cer- 
lain lype on fhe relationship belween words and 
ofher dafa lypes. Our experimenlal resulls show 
fhaf AufoExfend achieves slale-of-fhe-arf perfor¬ 
mance on word similarify and word sense disam¬ 
biguation. Along wilh fhis paper, we will pub¬ 
lish AuloExlend for exlending word embeddings 
lo ofher dafa lypes; fhe lexeme and synsel em¬ 
beddings used in fhe experimenls; and fhe code 
needed lo replicale our WSD evaluafioij^ 
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